Exploratory Data Analysis: Paris trees¶

Fedi GHANMI¶

In this Notebook, I will :

  • Present the data I have in hand, recite my assumptions and explore my data
  • Choose an angle of study and explain phenomenas to reach a conclusion.
  • You can clone and run this notebook to see the visualizations that comes with the explanations.
In [1]:
# Package import

import pandas as pd
from dataprep.eda import *
import re
import plotly.express as px
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import pypandoc

# ignore warnings library
import warnings
warnings.filterwarnings("ignore")
In [2]:
# mini algorithms for the purpose of preprocessing

def int_regex(text):
    """Keep only int number from string"""
    return re.sub(r'\D', '', text)


def get_rounding(ville):
    """ Get the district number based on city name """
    if ville == "BOIS DE VINCENNES":
        return 12

    elif (ville == "HAUTS-DE-SEINE") or (ville == "VAL-DE-MARNE") or (ville == "SEINE-SAINT-DENIS"):
        return 3

    elif ville == "BOIS DE BOULOGNE":
        return 16

    else:
        return 'Missing'

Load the Data and Perform Needed preprocessings.¶

  • Load data in csv format.
  • Delete non informative columns.
  • Split geo_point_2d columns into Latitude column and longitude column
  • Preprocess "ARRONDISSEMENT" columns to keep it purely numbers.
In [3]:
# Data Loading
paris_trees_set1 = pd.read_csv("data_csv/paris_trees_set1.csv", low_memory= False)
paris_trees_set2 = pd.read_csv("data_csv/paris_trees_set2.csv", low_memory= False)

paris_trees = paris_trees_set1.append(paris_trees_set2, ignore_index=True)
claims = pd.read_csv("data_csv/dans-ma-rue.csv", low_memory=False, sep=";")

We have 11 columns and more than 200k observations, which corresponds to more than 200k tree planted across Paris.

  • We will assume that this is the most up-to-date data of Paris trees since the data source indicates that 29 Septemnber 2022 was the last modification date of the data.
  • We will assume that the data posted by the source is true and reflects trees distribution in Paris.
In [4]:
paris_trees.head()
Out[4]:
IDBASE TYPE EMPLACEMENT DOMANIALITE ARRONDISSEMENT COMPLEMENT ADRESSE NUMERO LIEU / ADRESSE IDEMPLACEMENT LIBELLE FRANCAIS GENRE ESPECE VARIETE OUCULTIVAR CIRCONFERENCE (cm) HAUTEUR (m) STADE DE DEVELOPPEMENT REMARQUABLE geo_point_2d
0 2007514 Arbre Alignement BOIS DE VINCENNES NaN NaN ROUTE DU PESAGE 000101017 Platane Platanus x hispanica NaN 205 27 Adulte NON 48.82496365567108,2.4461391819662435
1 2031959 Arbre Alignement PARIS 13E ARRDT NaN NaN RUE EUGENE FREYSSINET 000104004 Aulne Alnus incana ''Aurea'' 20 2 Jeune (arbre) NON 48.83339630750579,2.3709812332986244
2 151442 Arbre CIMETIERE HAUTS-DE-SEINE NaN NaN CIMETIERE DE BAGNEUX / AVENUE DE L''AULNAIE / ... A03200096004 Aulne Alnus cordata NaN 0 0 NaN NaN 48.8025849381091,2.3080238449792954
3 279207 Arbre Alignement PARIS 13E ARRDT NaN NaN RUE THOMIRE 000202002 Erable Acer platanoides ''Crimson King'' 75 15 Adulte NON 48.81975318970025,2.348187507873684
4 291674 Arbre Alignement PARIS 1ER ARRDT 6 NaN RUE DU COLONEL DRIANT 000202009 Chêne Quercus robur ''Fastigiata'' 90 10 Jeune (arbre)Adulte NON 48.86334886328951,2.340765942541643
In [5]:
# delete non informative columns

non_info_cols_paris = ["IDBASE", "TYPE EMPLACEMENT", "COMPLEMENT ADRESSE",
                      "NUMERO", "IDEMPLACEMENT", "REMARQUABLE"]
non_info_cols_claims = ["ID DECLARATION", "TYPE DECLARATION", "SOUS TYPE DECLARATION",
                       "VILLE", "DATE DECLARATION", "OUTIL SOURCE", "INTERVENANT",
                       "ID_DMR", "geo_shape", "mois_annee_decla"]

paris_trees.drop(non_info_cols_paris, inplace=True, axis = 1)
claims.drop(non_info_cols_claims, inplace = True, axis = 1)
In [6]:
# splitting column into two other columns

claims['latitude'] = claims['geo_point_2d'].apply(
    lambda x: float(x[0:x.find(",")]) if not pd.isnull(x) else x)
claims['longitude'] = claims['geo_point_2d'].apply(
    lambda x: float(x[x.find(",")+1:-1]) if not pd.isnull(x) else x)
In [7]:
# splitting column into two other columns

paris_trees['latitude'] = paris_trees['geo_point_2d'].apply(
    lambda x: float(x[0:x.find(",")]) if not pd.isnull(x) else x)
paris_trees['longitude'] = paris_trees['geo_point_2d'].apply(
    lambda x: float(x[x.find(",")+1:-1]) if not pd.isnull(x) else x)
In [8]:
# Keeping "ARRONDISSEMENT" Column clean.

paris_trees['ARRONDISSEMENT'] = paris_trees['ARRONDISSEMENT'].apply(
    lambda x: int_regex(x) if any(chr.isdigit() for chr in x) else x)
paris_trees['ARRONDISSEMENT'] = paris_trees['ARRONDISSEMENT'].apply(
    lambda x: get_rounding(x) if not any(chr.isdigit() for chr in x) else x)

paris_trees["ARRONDISSEMENT"] = paris_trees["ARRONDISSEMENT"].astype(int)

Start Analysis¶

  • Describe some statistical features of data variables
  • Search for meaning and insight.
In [9]:
# Plot Box plot
plot(paris_trees, "latitude" , display=["Box Plot"])
  0%|                                                    | 0/41 [00:00<?, ?it/s]
Out[9]:
DataPrep.EDA Report
'box.color': #1f77b4
Color
'height': 400
Height of the plot
'width': 450
Width of the plot
In [10]:
# Plot Box plot
plot(paris_trees, "longitude", display=["Box Plot"])
  0%|                                                    | 0/41 [00:00<?, ?it/s]
Out[10]:
DataPrep.EDA Report
'box.color': #1f77b4
Color
'height': 400
Height of the plot
'width': 450
Width of the plot

Description: These 2 bar plots Shows the ranges of latitude and longitude limits of our data. for the latitude, it goes from 48.76 to 48.91 and a longitude of 2.21 to 2.47. These ranges corresponds to the delimitations of paris districts. From 1st to 20th district. (Arrondissement). We can see that our 25% quantile and 75% quantile exist in the range of respective latitude and longitude of 48.83 to 48.87 and 2.3 to 2.83. Since our observations in our data are trees, we can conclude that more than 50% of our trees exist in these cooridnates. But what are these coordinates exactly ?

In [11]:
# Plot Viz
plot(paris_trees, "DOMANIALITE",  display=["Bar Chart", "Value Table"])
  0%|                                                    | 0/46 [00:00<?, ?it/s]
Out[11]:
DataPrep.EDA Report
'bar.bars': 10
Maximum number of bars to display
'bar.sort_descending': True
Whether to sort the bars in descending order
'bar.yscale': 'linear'
Y-axis scale ("linear" or "log")
'bar.color': '#1f77b4'
Color
'height': 400
Height of the plot
'width': 450
Width of the plot
'value_table.ngroups': 10
The number of distinct values to show
Value Count Frequency (%)
Alignement 106206
51.6%
Jardin 48933
23.8%
CIMETIERE 31894
15.5%
DASCO 7254
 
3.5%
PERIPHERIQUE 5259
 
2.6%
DJS 4726
 
2.3%
DFPE 1364
 
0.7%
DAC 61
 
< 0.1%
DASES 39
 
< 0.1%

Description: In paris we have more than 200 thousand trees planted. If we check our Bar Chart, we see that "alignement" type of trees are very abundant with more than 100 thousand tree across paris with a percentage of more than 50% of the total green space of paris. Next in line are "Jardin" and "Cimeterie" which if combined reach around 40% of paris green space.

In [12]:
# subsetting of paris trees into aligmenent and jardin

alignement = paris_trees.loc[
    paris_trees["DOMANIALITE"] == "Alignement",:]
jardin = paris_trees.loc[
    paris_trees["DOMANIALITE"] == "Jardin",:]
In [13]:
# Plot vizualization
plot(alignement, "GENRE", display=["Pie Chart", "Value Table"])
  0%|                                                    | 0/46 [00:00<?, ?it/s]
Out[13]:
DataPrep.EDA Report
'pie.slices': 10
Maximum number of pie slices to display
'pie.sort_descending': True
Whether to sort the slices in descending order of frequency
'pie.colors': ['#1f77b4', '#aec7e8', '#ff7f0e', '#ffbb78', '#2ca02c', '#98df8a', '#d62728', '#ff9896', '#9467bd', '#c5b0d5', '#8c564b']
List of colors
'height': 400
Height of the plot
'width': 450
Width of the plot
'value_table.ngroups': 10
The number of distinct values to show
Value Count Frequency (%)
Platanus 35686
33.6%
Aesculus 16185
15.2%
Tilia 12040
11.3%
Sophora 8633
8.1%
Acer 5808
 
5.5%
Celtis 3040
 
2.9%
Corylus 2363
 
2.2%
Pyrus 2329
 
2.2%
Fraxinus 2278
 
2.1%
Prunus 1861
 
1.8%
Other values (73) 15983
15.0%

Description: Alignement trees are mainly constituted of Platanus strain. This strain account for more than 30% of alignement trees across Paris. Aesculus and Tilia are less frequent but still exist with good proportions ranging respectively to 15% and 11%. Below are some images of these strains, which are actually familiar to us.

Platanus:

No description has been provided for this image
No description has been provided for this image

Aesculus

No description has been provided for this image
In [14]:
# Plot Pie chart
plot(jardin, "GENRE", display=["Pie Chart", "Value Table"])
  0%|                                                    | 0/46 [00:00<?, ?it/s]
Out[14]:
DataPrep.EDA Report
'pie.slices': 10
Maximum number of pie slices to display
'pie.sort_descending': True
Whether to sort the slices in descending order of frequency
'pie.colors': ['#1f77b4', '#aec7e8', '#ff7f0e', '#ffbb78', '#2ca02c', '#98df8a', '#d62728', '#ff9896', '#9467bd', '#c5b0d5', '#8c564b']
List of colors
'height': 400
Height of the plot
'width': 450
Width of the plot
'value_table.ngroups': 10
The number of distinct values to show
Value Count Frequency (%)
Tilia 4974
10.2%
Acer 4724
9.7%
Pinus 3857
 
7.9%
Prunus 3400
 
6.9%
Aesculus 2878
 
5.9%
Quercus 2403
 
4.9%
Platanus 2045
 
4.2%
Sophora 1537
 
3.1%
Betula 1509
 
3.1%
Fagus 1492
 
3.0%
Other values (156) 20114
41.1%

Description : If we check the trees distribution in the gardens "jardin", we see that gardin trees are very diversified and there is no one strain that is dominating garden trees. In the Pie Chart, we see equally colored sections in the upper half, and a big pie in the lower half that represents all other minority strains!

Another POV : all these trees and strains are aged differently, so how is their distribution according to their age?

In [15]:
# Plot Statistics
plot(paris_trees, "HAUTEUR (m)", display=["Stats"])
  0%|                                                    | 0/46 [00:00<?, ?it/s]
Out[15]:
DataPrep.EDA Report

Overview

Approximate Distinct Count42.0135
Approximate Unique (%)0.0%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Memory Size3291776
Mean8.7805
Minimum0
Maximum86
Zeros25926
Zeros (%)12.6%
Negatives0
Negatives (%)0.0%

Quantile Statistics

Minimum0
5-th Percentile0
Q15
Median8
Q312
95-th Percentile20
Maximum86
Range86
IQR7

Descriptive Statistics

Mean8.7805
Standard Deviation5.9339
Variance35.2114
Sum1.8065e+06
Skewness0.5655
Kurtosis0.2438
Coefficient of Variation0.6758

Description : if we study the variable stats, more than 12% of our trees have zero height. which means they are recently planted (having less than 1 meter height). Our variable median is equal to 8 meaning more than 50% of trees have a height superior to 8 meters. having a mean of around 8.7 indicate that trees height distribution is more or less normally distributed with a small positive skewness. meaning that most of our trees revolves around a height of 8.7 meters.

In [16]:
# more than 50% of our plants are mature(70cm cironference)
plot(paris_trees, "CIRCONFERENCE (cm)", display=["Stats"])
  0%|                                                    | 0/46 [00:00<?, ?it/s]
Out[16]:
DataPrep.EDA Report

Overview

Approximate Distinct Count460.6149
Approximate Unique (%)0.2%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Memory Size3291776
Mean81.0143
Minimum0
Maximum2246
Zeros20121
Zeros (%)9.8%
Negatives0
Negatives (%)0.0%

Quantile Statistics

Minimum0
5-th Percentile0
Q130
Median70
Q3115
95-th Percentile200
Maximum2246
Range2246
IQR85

Descriptive Statistics

Mean81.0143
Standard Deviation62.9701
Variance3965.2312
Sum1.6668e+07
Skewness1.3598
Kurtosis9.7232
Coefficient of Variation0.7773

Description : For the width of these plants or trees, we observe that we have a mean of 81 cm and a deviation of 62 cm. This huge deviation is due to some outliers that have unkown origins. if we see the maximum value, it is written 2246 cm, which is an irregularity obviously. but for the 95th percentile, we see trees having 200cm width.

In [17]:
# Plot Scatter Plot
fig = px.scatter(paris_trees.sample(5000, random_state=1), x="CIRCONFERENCE (cm)", y='HAUTEUR (m)')
fig.show()

Description : The scatter plot of these 2 variables indicate a positive correlation between them. which is totally normal. The more a plant has bigger width we expect it to have a higher height.

In [18]:
# Plot Heatmap
plot_correlation(paris_trees, display=["Pearson"])
Out[18]:
DataPrep.EDA Report
'height': 400
Height of the plot
'width': 400
Width of the plot

Description of Height and width : According to this heatmap, Width and height are correlated with a 80% pearson coefficient. This can mean that 80% of the time when a plant has higher width, it has higher height. But we expected this relationship for all plants. What happened to the rest of the 20%? We can explain that 20% of the times can be due to difference in plant strains. Some strains may reach a limit in its width while having a lesser height than another strain.

Description of other variables : For the remaining variables, we see a very weak correlation, suggesting there is not a direct relashionship between latitude, longitude, and districts between each other.

Another POV : Now after studying the types of plants, their strain and their variable correlations, let us examine their location in relation to paris districts.

In [19]:
# Subsetting and Map Display

arrondissement_paris = pd.DataFrame(pd.pivot_table(paris_trees, index=["ARRONDISSEMENT"] 
                                                   ,aggfunc="size"), columns=["Occurences"])
arrondissement_rec = pd.DataFrame(pd.pivot_table(claims, index=["ARRONDISSEMENT"] ,aggfunc="size"),
                                  columns=["Occurences"])
No description has been provided for this image

Description : The distribution of trees across paris is generalized in the above map. The greener the district, the more trees it have, ranging from 500 trees to the greenest district which have more than 25k plants. We notice that the outer districts are more rich in plants than the inner ones. Starting from the 12th district until the 20th (last) one. Since 2020, The French republic released an application which is called "Dans ma Rue". Its purpose is to declare or claim online to the municipality any aberration that may happen on the streets. Among all of these claims, we will study the distribution of complaints regarding trees and plants.

In [20]:
# Plot Value Table
plot(claims, "ANNEE DECLARATION", display=["Value Table"])
  0%|                                                    | 0/45 [00:00<?, ?it/s]
Out[20]:
DataPrep.EDA Report
'value_table.ngroups': 10
The number of distinct values to show
Value Count Frequency (%)
2022 3837
50.7%
2021 3736
49.3%

Description : According to this Value Table, we see that our historic data is uptodate and contains the past 2 years complaints with half of all complaints are of 2021 and the other half in 2022.

No description has been provided for this image

Description : The distribution of complaints have the same color gradients as of distribution of trees. with a more red area containing more complaints. These complaints range from 80 claims to more than 800 per district. We see also the 15th district having the most complaints in all of paris. But why is that ?

In [21]:
# Subsetting

paris_lat_long = paris_trees.drop(["LIEU / ADRESSE", "LIBELLE FRANCAIS", "geo_point_2d"], axis = 1)
paris_ext= paris_lat_long.loc[(paris_lat_long["ARRONDISSEMENT"] >= 12), :]
paris_int= paris_lat_long.loc[(paris_lat_long["ARRONDISSEMENT"] < 12), :]
In [22]:
# cimeterie = 40% > 35% du zone exterieur.
plot(paris_int, "DOMANIALITE", display=["Pie Chart"])
  0%|                                                     | 0/9 [00:00<?, ?it/s]
Out[22]:
DataPrep.EDA Report
'pie.slices': 10
Maximum number of pie slices to display
'pie.sort_descending': True
Whether to sort the slices in descending order of frequency
'pie.colors': ['#1f77b4', '#aec7e8', '#ff7f0e', '#ffbb78', '#2ca02c', '#98df8a', '#d62728', '#ff9896']
List of colors
'height': 400
Height of the plot
'width': 450
Width of the plot

Description : The above pie chart summarizes the type of forestry in the interior districs from first to 12th district. we see around 40% are for cimeteries and the other 40% is for alignement. and a minority percentage for gardens reaching 13% of green spaces. Having small number of complaints in these regions may only conclude that cimeteries and alignement are in good shape.

In [23]:
# Plot Pie Chart
plot(paris_ext, "DOMANIALITE", display=["Pie Chart"])
  0%|                                                     | 0/9 [00:00<?, ?it/s]
Out[23]:
DataPrep.EDA Report
'pie.slices': 10
Maximum number of pie slices to display
'pie.sort_descending': True
Whether to sort the slices in descending order of frequency
'pie.colors': ['#1f77b4', '#aec7e8', '#ff7f0e', '#ffbb78', '#2ca02c', '#98df8a', '#d62728', '#ff9896', '#9467bd']
List of colors
'height': 400
Height of the plot
'width': 450
Width of the plot

description : The pie Chart above summarizes the types of plants on the exterior part of paris. Districts from 12 to 20. We see that 55% are alignement plants and surprisingly about 30% are Gardens. We can thus explain the important number of complaints due to maybe not clean gardens. garden not well maintained by the municipality, overflow of vegetation, ...

In [24]:
# Subtype declaration word cloud display
claims_dec = pd.read_csv("data_csv/dans-ma-rue_v2.csv", low_memory= False, sep=";")
claims_dec_ext = claims_dec.loc[claims_dec["ARRONDISSEMENT"] >= 12,:]
plot(claims_dec_ext, "SOUS TYPE DECLARATION", display=["Word Cloud"])
  0%|                                                    | 0/15 [00:00<?, ?it/s]
Out[24]:
DataPrep.EDA Report
'wordcloud.top_words': 30
Maximum number of most frequent words to display
'wordcloud.stopword': True
Whether to remove stopwords
'wordcloud.lemmatize': False
Whether to lemmatize the words
'wordcloud.stem': False
Whether to apply Potter Stem on the words
'height': 400
Height of the plot
'width': 450
Width of the plot

Description : The word cloud above summarizes the complaints posted by users of the applications. We see the words "arbre", "herbes", "animaux", "insecteprésence", "jardiniere", "animal", "rat", ... All of these vocabulary confirms what we previously concluded. Gardens are not well maintained. So we ask the question: What is the future of Paris forestry?

Conclusion : Dans ma rue application has helped gather data about areas which needs most taking care of. The only thing remaining now is to act upon this data and save paris forestry!